February 22, ’23

talk overview

  • Applications

  • Objectives

  • Methods

  • Interpretation

applications - conservation

  • aid in discovery of new populations of imperiled plants

  • aid in creation of reserves under climate change models

  • aid in predicting joint species distributions, i.e. obligate mutualisms

  • aid in modelling spread of noxious species to novel ranges (but see Liu et al. (2020))

applications - academic

  • develop candidate species for metabarcoding
  • fine scale models to study co-existence
  • used in population biology (but see A. Lee-Yaw et al. (2022))

objectives

using known occurrences of a species, identify areas which have similar habitat and the potential to support populations

but, what about dispersal?

competition?

mutualisms?

an example species to accompany our study

Besseya (=Synthris) alpina (A. Gray) Rydberg.

American Basin

B. alpina, Franklin #3948

methods - overview

define spatial domain and grain

software environments

dependent variables

independent variables

modelling approaches

model evaluation

predicting a model into space

domain and grain

domain; spatial extent of study
- administrative boundary
- ecological model

grain; scales in space and time
- resolution at which process occurs (space)
- current and past climate (time)
- projected climates
- (animals) seasonal patterns?

limitation: compute power

domain and grain

Domain

software environments

  • R
    • sf vector data ala tidyverse Pebesma (2018)
    • terra raster data without headaches Hijmans (2022)
    • sdm modelling operations Naimi and Araujo (2016)
    • caret ML / data partitioning Kuhn (2022)
  • grass gis many modules for creating predictors
  • qgis graphical user interface for mouse guided visualization

dependent variables - presences

  • occurrences of a species in space (and time)
  • geographic accuracy / spatial grain
  • Linear models:
    • check for spatial autocorrelation using Morans I
    • sampling artifact? remove samples
    • thin points stepwise by most ‘offending’ record

dependent variables - presences example

Occurence Records

dependent variables - absences

  • optional: pseudoabsences space (and time)
  • more optional: absences space (and time)
  • presence:absence records
    • linear models - 1:many
    • machine learning - 1:1
  • distance between presences
    • geographic
    • environmental

dependent variables - absences example

just create some random red points and add em to the map above.

independent variables

  • variables relating to patterns in biotic distribution
    • relevant to your extent and grains
    • require variation
    • no focus on factors governing biological processes, rather features which correlate with the known species distribution
  • limitation: compute power

independent variables - examples

domain: continental (e.g. North America)
- maximum and minimum daily temperatures [monthly, 4km]
- precipitation [monthly, 4km]
- hydrologic drainage [millenial, 4km]

domain: regional (e.g. Southern Rockies)
- elevation [millenial, 1km]
- soil classes [millenial, 1km]
- solar radiation [millenial, 1km]
- precipitation form [monthly, 1km]

domain: fine (e.g. McDonald Woods)
- micro topography [decade, 1m]
- water relations [decade, 1m]
- shade [weekly, 1m]
- soils [decade, 1m]

independent variables - specific examples

  • Percent bedrock (rocky, young soils)
  • Elevation (alpine habitat)
  • Bare ground (few others plants?)
  • X-Y coords (alpine zone decreases with latitude)
  • Soil surface pH (calcareous bedrock?)
  • Precipitation as snow (monsoonal influence?)

variance in independent variables

  • explicitly check for variation

  • carefully encode categorical data

  • too much, may not be useful

  • too little, may not be useful

  • pilot knock out studies; use one variable leaving the others out

  • warrants simplifying a variable?

  • t.test the difference in values between presence and absence points

modelling approaches - overview

  • ensembles pt. I & II
  • linear models
    • assumptions - dependent variables (IV)
    • assumptions - independent variables (DV)
    • modelling
      • model evaluation
    • machine learning
      • assumptions ?
      • modelling
        • model evaluation
  • ensembles pt. III

ensembles pt I

problems with all models
- garbage.(in) -> garbage.(out)
- influential outliers

with machine learning;
- models can fixate on these observations

solution:
- run many models, synthesize the results



“we are stronger together than we are alone” - Walter Payton

ensembles pt II

weak learners; many simple decision tree models are combined to a single output

bootstrap aggregation (bagging) e.g. random forests
- many models run independent of each

boosting e.g. boosted regression trees
- many models run sequentially, focusing on correcting errors in the last iteration

stacking
- can accommodate the consensus output from bagged, boosted, or traditional linear models

linear models

  • commonly implemented:
    • generalized linear models (glms)
    • generalized additive models (gams)

linear models - assumptions DV

  • distinct records, e.g. no duplicates of herbaria specimens
  • one record per cell of gridded surface

linear models - assumptions IV

  • Variance inflation (vifstep or vifcor)
  • identify correlated variables
  • pilot knock out studies; use one variable of the set as a predictor leaving the others out

correlated!

modelling process

all evaluation performed by computer – too much information

machine learning

much more common approach than individual linear models

many ‘weak learners’

species distributions are generally too complex for individual predictors, and building fully interactive terms would take a long time.

the typical approach since the late 90’s

do the work for you

machine learning - assumptions

none, get a few observations, the more the merrier.

modelling

train/test split (partition data)

  • no free lunch

    • no silver bullet machine learning algorithm
    • each is able to work better than others under diverse circumstances
  • try many types of models, select some that work for your application

  • common algorithms:

    • maximum entropy (maxent)
    • random forest (rf)
    • boosting (brt)
    • support vector model (svm)

evaluation I

Practitioners are always wrong.

How do you want to be wrong?

the downsides of predicting suitable habitat where it isn’t?

the downsides of predicting non-suitable habitat where it really is?

What is the cost of ‘better’ predictions?

other projects? priorities? deadlines?

evaluation II

\[ Accuracy = \frac{\text{correct classifications}}{\text{all classifications }} \]

\[ Sensitivity = \frac{\text{true positives}}{\text{true positives + false negatives }} \] probability of the method giving a positive result when the test subject is positive.

\[ Specificity = \frac{\text{true negatives}}{\text{true negatives + false positives }} \] probability of the method giving a negative result when the test subject is negative

evaluation III

  • Area Under the ROC Curve (AUC)
  • True Skill Statistic (TSS)
    • unequal sample sizes, i.e. evaluate many models at once
  • Cohen’s kappa

evaluation IV - example

AUC TSS Kappa
0.997 0.969 0.945
0.994 0.969 0.945
0.995 0.945 0.910

ensembles pt II

  • several r packages offer stacking of many models, based on your selection of evaluation criteria to weigh them
  • the ‘test’ partition of your data are used to evaluate this final model

predicting a model into space

  • any model, based on values present in gridded (raster) data can be predicted onto a new raster surface
  • each covariate in the model, is generally a single raster layer
  • r packages, such as terra, do all the work for you
  • accordingly, species distribution models create a map as a product

interpretation

  • percent suitability of habitat
  • of each cell - not proportion of cell
  • based on your model

interpretation - example

suitability

computer limitations

tips and tricks

  • keep a lab notebook; this is bench science

  • always start models small (avoid computer crashes)

  • use strong and discrete directory organization

  • scratch paper, whiteboards, flowcharts

  • dynamic programming; import/export data

  • track code on github

objectives - recapped

using known occurrences of a species, identify areas which have similar habitat and the potential to support populations

mutualisms?

modelling two species together, joint-SDMs

competition?

using suitability surfaces to plant assemblages of species in field experiments’

dispersal?

using gridded surfaces to model the probability of dispersal from known to suitable habitat

conclusion

  • species distribution models are very simple!
  • fun introduction to simple machine learning
  • represent a hypothesis of the probability of suitable habitat
  • new avenues (J-SDM’s) can include mutualisms
  • stacked species (S-SDM’s) distributions for predicting ecological assemblies

contact info

- github/sagesteppe

some extra info

modelling resources

two hour discussion of the ‘sdm’ package by an author

large repository for high throughput modelling

large repository about spatial data in R

short activity using a sdm like process to teach spatial data

modelling ensemble learning

Ensemble learning utilizes many sets of trees, each tree being composed of many binary decisions, to create a single model. Each independent variable ( - or feature) may become a node on the tree - i.e. a location on the tree where a binary decision will move towards a predicted outcome. Each of the decision tree models which ensemble learning utilizes is a weak model, each of which may suffer due to high variance or bias, but which produce better outcomes than would be expected via chance. When ensembled these models generate a strong model, a model which should have more appropriately balanced variance and bias and predicts outcomes which are more strongly correlated with the expected values than the individual weak models.

modelling random forest

Random Forest (RF) the training data are continually bootstrap re-sampled, in combination with random subsets of features, to create nodes which attempt to optimally predict a known outcome. A large number of trees are then aggregated, via the most common predictions, to generate a final classification prediction tree. Each individual prediction tree is generated independently of the others.

modelling boosted regression trees

Boosted Regression Tree (BRT) (or Gradient Boosted tree) An initial tree is grown, and all other trees are derived sequentially from it, as each new tree is grown the errors in responses from the last tree are weighed more heavily so that the model focuses on selecting dependent variables which refine predictions. All response data and predictor variables are kept available to all trees.

citations

A. Lee-Yaw, Julie, Jenny L. McCune, Samuel Pironon, and Seema N. Sheth. 2022. “Species Distribution Models Rarely Predict the Biology of Real Populations.” Ecography 2022 (6): e05877.

Hijmans, Robert J. 2022. Terra: Spatial Data Analysis. https://CRAN.R-project.org/package=terra.

Kuhn, Max. 2022. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.

Liu, Chunlong, Christian Wolter, Weiwei Xian, and Jonathan M Jeschke. 2020. “Species Distribution Models Have Limited Spatial Transferability for Invasive Species.” Ecology Letters 23 (11): 1682–92.

Naimi, Babak, and Miguel B. Araujo. 2016. “Sdm: A Reproducible and Extensible r Platform for Species Distribution Modelling.” Ecography 39: 368–75. https://doi.org/10.1111/ecog.01881.

Pebesma, Edzer. 2018. Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.